13 research outputs found

    EBL-Hope: Multilingual Word Sense Disambiguation Using a Hybrid Knowledge-Based Technique

    Get PDF
    We present a hybrid knowledge-based approach to multilingual word sense disambiguation using BabelNet. Our approach is based on a hybrid technique derived from the modified version of the Lesk algorithm and the Jiang & Conrath similarity measure. We present our system's runs for the word sense disambiguation subtask of the Multilingual Word Sense Disambiguation and Entity Linking task of SemEval 2015. Our system ranked 9th among the participating systems for English

    A Context-Adaptive Ranking Model for Effective Information Retrieval System

    Get PDF
    Abstract When using Information Retrieval (IR) systems, users often present search queries made of ad-hoc keywords. It is then up to information retrieval systems (IRS) to obtain a precise representation of user’s information need, and the context of the information. Context-aware ranking techniques have been constantly used over the past years to improve user interaction in their search activities for improved relevance of retrieved documents. Though, there have been major advances in context-adaptive systems, there is still a lack of technique that models and implements context-adaptive application. The paper addresses this problem using DROPT technique. The DROPT technique ranks individual user information needs according to relevance weights. Our proposed predictive document ranking model is computed as measures of individual user search in their domain of knowledge. The context of a query determines retrieved information relevance. Thus, relevant context aspects should be incorporated in a way that supports the knowledge domain representing users’ interests. We demonstrate the ranking task using metric measures and ANOVA, and argue that it can help an IRS adapted to a user's interaction behaviour, using context to improve the IR effectiveness

    An optimized Lesk-based algorithm for word sense disambiguation

    Get PDF
    Computational complexity is a characteristic of almost all Lesk-based algorithms for word sense disambiguation (WSD). In this paper, we address this issue by developing a simple and optimized variant of the algorithm using topic composition in documents based on the theory underlying topic models. The knowledge resource adopted is the English WordNet enriched with linguistic knowledge from Wikipedia and Semcor corpus. Besides the algorithm’s e ciency, we also evaluate its e ectiveness using two datasets; a general domain dataset and domain-speci c dataset. The algorithm achieves a superior performance on the general domain dataset and superior performance for knowledge-based techniques on the domain-specific dataset

    A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery

    No full text
    Electronic data discovery (EDD), e-discovery or eDiscovery is any process by which electronically stored information (ESI) is sought, identified, collected, preserved, secured, processed, searched for the ones relevant to civil and/or criminal litigations or regulatory matters with the intention of using them as evidence. Searching electronic document collections for relevant documents is part of eDiscovery which poses serious problems for lawyers and their clients alike. Getting efficient and effective techniques for search in eDiscovery is an interesting and still an open problem in the field of legal information systems. Researchers are shifting away from traditional keyword search to more intelligent approaches such as machine learning (ML) techniques. State-of-the-art algorithms for search in eDiscovery focus mainly on supervised approaches, mainly; supervised learning and interactive approaches. The former uses labelled examples for training systems while the latter uses human assistance in the search process to assist in retrieving relevant documents. Techniques in the latter approach include interactive query expansion among others. Both approaches are supervised form of technology assisted review (TAR). Technology assisted review is the use of technology to assist or completely automate the process of searching and retrieval of relevant documents from electronically stored information (ESI). In text retrieval/classification, supervised systems are known for their superior performance over unsupervised systems. However, two serious issues limit their application in the electronic discovery search and information retrieval (IR) in general. First, they have associated high cost in terms of finance and human effort. This is particularly responsible for the huge amount of money expended on eDiscovery on annual basis. Secondly, their case/project-specific nature does not allow for resuse, thereby contributing more to organizations' expenses when they have two or more cases involving eDiscovery. Unsupervised systems on the other hand, is cost-effective in terms of finance and human effort. A major challenge in unsupervised ad hoc information retrieval is that of vocabulary problem which causes terms mismatch in queries and documents. While topic modelling techniques try to tackle this from the thematic point of view in the sense that both queries and documents are likely to match if they discuss about the same topic, natural language processing (NLP) approaches view it from the semantic perspective. Scalable topic modelling algorithms, just like the traditional bag of words technique, suffer from polysemy and synonymy problems. Natural language processing techniques on the other hand, while being able to considerably resolve the polysemy and synonymy problems are computationally expensive and not suitable for large collections as is the case in eDiscovery. In this thesis, we exploit the peculiarity of eDiscovery collections being composed mainly of e-mail communications and their attachments, mining topics of discourse from e-mails and disambiguating these topics and queries for terms matching has been proven to be effective for retrieving relevant documents when compared to traditional stem-based retrieval. In this work, we present an automated unsupervised approach for retrieval/classification in eDiscovery. This approach is an ad hoc retrieval which creates a representative for each original document in the collection using latent dirichlet allocation (LDA) model with Gibbs sampling and explores word sense disambiguation (WSD) to give these representative documents and queries deeper meanings for distributional semantic similarity. The word sense disambiguation technique by itself is a hybrid algorithm derived from the modified version of the original Lesk algorithm and the Jiang & Conrath similarity measure. Evaluation was carried out on this technique using the TREC legal track. Results and observations are discussed in chapter 8. We conclude that WSD can improve ad hoc retrieval effectiveness. Finally, we suggest further work focusing on efficient algorithms for word sense disambiguation which can further improve retrieval effectiveness if applied to original document collections in contrast to using representative collections

    A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery

    Get PDF
    Electronic data discovery (EDD), e-discovery or eDiscovery is any process by which electronically stored information (ESI) is sought, identified, collected, preserved, secured, processed, searched for the ones relevant to civil and/or criminal litigations or regulatory matters with the intention of using them as evidence. Searching electronic document collections for relevant documents is part of eDiscovery which poses serious problems for lawyers and their clients alike. Getting efficient and effective techniques for search in eDiscovery is an interesting and still an open problem in the field of legal information systems. Researchers are shifting away from traditional keyword search to more intelligent approaches such as machine learning (ML) techniques. State-of-the-art algorithms for search in eDiscovery focus mainly on supervised approaches, mainly; supervised learning and interactive approaches. The former uses labelled examples for training systems while the latter uses human assistance in the search process to assist in retrieving relevant documents. Techniques in the latter approach include interactive query expansion among others. Both approaches are supervised form of technology assisted review (TAR). Technology assisted review is the use of technology to assist or completely automate the process of searching and retrieval of relevant documents from electronically stored information (ESI). In text retrieval/classification, supervised systems are known for their superior performance over unsupervised systems. However, two serious issues limit their application in the electronic discovery search and information retrieval (IR) in general. First, they have associated high cost in terms of finance and human effort. This is particularly responsible for the huge amount of money expended on eDiscovery on annual basis. Secondly, their case/project-specific nature does not allow for resuse, thereby contributing more to organizations' expenses when they have two or more cases involving eDiscovery. Unsupervised systems on the other hand, is cost-effective in terms of finance and human effort. A major challenge in unsupervised ad hoc information retrieval is that of vocabulary problem which causes terms mismatch in queries and documents. While topic modelling techniques try to tackle this from the thematic point of view in the sense that both queries and documents are likely to match if they discuss about the same topic, natural language processing (NLP) approaches view it from the semantic perspective. Scalable topic modelling algorithms, just like the traditional bag of words technique, suffer from polysemy and synonymy problems. Natural language processing techniques on the other hand, while being able to considerably resolve the polysemy and synonymy problems are computationally expensive and not suitable for large collections as is the case in eDiscovery. In this thesis, we exploit the peculiarity of eDiscovery collections being composed mainly of e-mail communications and their attachments, mining topics of discourse from e-mails and disambiguating these topics and queries for terms matching has been proven to be effective for retrieving relevant documents when compared to traditional stem-based retrieval. In this work, we present an automated unsupervised approach for retrieval/classification in eDiscovery. This approach is an ad hoc retrieval which creates a representative for each original document in the collection using latent dirichlet allocation (LDA) model with Gibbs sampling and explores word sense disambiguation (WSD) to give these representative documents and queries deeper meanings for distributional semantic similarity. The word sense disambiguation technique by itself is a hybrid algorithm derived from the modified version of the original Lesk algorithm and the Jiang & Conrath similarity measure. Evaluation was carried out on this technique using the TREC legal track. Results and observations are discussed in chapter 8. We conclude that WSD can improve ad hoc retrieval effectiveness. Finally, we suggest further work focusing on efficient algorithms for word sense disambiguation which can further improve retrieval effectiveness if applied to original document collections in contrast to using representative collections

    An Optimized Lesk-Based Algorithm for Word Sense Disambiguation

    No full text
    Computational complexity is a characteristic of almost all Lesk-based algorithms for word sense disambiguation (WSD). In this paper, we address this issue by developing a simple and optimized variant of the algorithm using topic composition in documents based on the theory underlying topic models. The knowledge resource adopted is the English WordNet enriched with linguistic knowledge from Wikipedia and Semcor corpus. Besides the algorithm’s eficiency, we also evaluate its efectiveness using two datasets; a general domain dataset and domain-specific dataset. The algorithm achieves a superior performance on the general domain dataset and superior performance for knowledge-based techniques on the domain-specific dataset

    Algorithm for Information Retrieval Optimization

    No full text
    When using Information Retrieval Systems (IRS), users often present search queries made of ad-hoc keywords. It is then up to the IRS to obtain a precise representation of the user's information need and the context of the information. This paper investigates optimization of IRS to individual information needs in order of relevance. The study addressed development of algorithms that optimize the ranking of documents retrieved from IRS. This study discusses and describes a Document Ranking Optimization (DROPT) algorithm for information retrieval (IR) in an Internet-based or designated databases environment. Conversely, as the volume of information available online and in designated databases is growing continuously, ranking algorithms can play a major role in the context of search results. In this paper, a DROPT technique for documents retrieved from a corpus is developed with respect to document index keywords and the query vectors. This is based on calculating the weight
    corecore